Supervised Learning - Foundations: ReCell

By: Sushma Rao

Context

Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth \$52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.

Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.

Objective

The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.

Data Description

The data contains the different attributes of used/refurbished phones and tablets. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Data Overview

observations:

  1. brand name,os,4g,5g are of the type object.

    2.screen size,main camera mp, selfie camera mp, int memory, ram, battery, weight, new price, used price are of the type float.

    3.release year and days used are of the type integer.

  2. we can clearly see that there are few missing values.

observation:There are no duplicated values.

observations: main camera mp ,selfies camera mp, int memory,ram,battery,weight has missing values. main camera mp has the highest number of missing values.

observations: 1.The average used_price is 92.3 euros. 2.The missing values are entered as NAN. 3.There are no unusual data entries. 4.There are 34 unique brand name and 4 different os(operating systems). 5. brand name is entered as others for most of the devices. 6. Android is the most used os.

Exploratory Data Analysis (EDA)

Questions:

  1. What does the distribution of used device prices look like?
  2. What percentage of the used device market is dominated by Android devices? 3.The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand? 4.A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)? 5.Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches? 6.Budget devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget devices offering greater than 8MP selfie cameras across brands? 7.Which attributes are highly correlated with the price of a used device?

1. What does the distribution of used price look like?

observations:

1. The distribution of used price looks right skewed.
2. There are a few outliers.
3. The mean is greater than the median.

Univariate Analysis of the Data

observations: The distribution of new price looks right skewed with outliers. The distribution looks similar to that of used price.

Observations: The plot clearly shows that there are many outliers,the used devices can be either phone or a tablet and the size of the screen will definetly vary.so i would like to consider all the data points or else it might alter the model performance.

Observations:The distribution is right skewed.

Observations:The distribution is skewed.

Observation: The distribution is definetly skewed.

observation: most of the devices have 4GB RAM. The plot shows some outliers too.

Observations:The distribution is positively skewed .looks like the heavy weighted devices are the used tablets, and they are few in number.

Observations: There are more devices with energy capacity of the battery below 4000 mAH. the plot clearly shows there are ouliers.

Observations: The distribution looks left skewed.

2.What percentage of the used device market is dominated by Android devices?

observations:

1. The labeled bar chart clearly shows that used device market is clearly dominated by Android os.

2. 93.1% of the used devices use Android os.

Univariate Analysis of Categorical Data

Observations: The brand name for most of the devices is entered as others. Nearly 14.5% of the devices doent have a proper brand name.

Samsung and Huawei are the 2 brand names with better percentage compared to all other brands.

Observations:

1.Only 4.4% have 5g availability.

2.Nearly 68% of the devices have 4g .

Observation:Most of the used devices belong to 2013 and 2014 .

Bivariate Analysis

Observations:

  1. used price is positively correlated with the new price. 2.screen size is positively correlated with battery and weight. 3.battery and weight are also positively correlated.
  2. selfie camera mp is positively correlated with used price and negatively correlated with days used.

Observations: There seems to some linear relationship between used price and new price , and also between battery and weight.

3.The amount of RAM is important for the smooth functioning of a device. How does the amount of RAM vary with the brand?

Observations:

1. Oneplus brand offers the most RAM , while Celkon offers the least RAM.

2.Nokia and Infinix also offers less RAM, but better compared to Celkon.

3. All other brands offer almost same RAM, which is decent compared to Celkon, Nokia and Infinix.

4. One plus is the one to be considered when anyone is lookind for more RAM.

4. A large battery often increases a device's weight, making it feel uncomfortable in the hands. How does the weight vary for phones and tablets offering large batteries (more than 4500 mAh)?

observations: There are 341 devices whose batteries energy capacity is ore than 4500mAh.

observations: the output clearly shows that as the energy capacity of the battery increases, the weight of the device also increases.

observations:The weight of the Google devices are more ,followed by lenevo,Apple and Sony, Micromax has the least weight among all the given brands.

observations:

1.Google, HTC,Micromax,Nokia,Spice and Oppo have same tpye of device, hence the black line on the bars are missing.

2.The Google brand device weighs the most , may be its a tablet.

3.Micromax brand weighs the least and may be its a mobile phone.

4.Bigger screens are desirable for entertainment purposes as they offer a better viewing experience. How many phones and tablets are available across different brands with a screen size larger than 6 inches?

Observations: There are 1099 devices available with screen siZe larger than 6 inches.

Observations:

1. Huawei brand has the devices with screen size larger than 6 inches followed by Samsung.

2. Microsoft has only one device with the screen size larger than 6inches.

3.Huawei brand provides bigger screens for costumers that are desirable for entertainment purposes as they offer a better viewing experience

6.Budget devices nowadays offer great selfie cameras, allowing us to capture our favorite moments with loved ones. What is the distribution of budget devices offering greater than 8MP selfie cameras across brands?

observation:

The mean of the new price of the devices is around 238 euros, in order to find the distribution of budget devices ,our first step is to cut the dat set into budget friendly, mid ranger and premium categories

observations: The percentage of the Budget friendly devices are more compared to the premium .

lets check the distribution for device category Budget

observations:

1. we have divided the device into budget, midranger and premium based on the new price.

2. There are more number of budget friendly devices.

3. we have created a data set having selfie camera mp more than 8 mp.

4.It is clear from both the count plots that Xiaomi is the brand which is Budget friendly and also offers a selfie camera which is greater than 8 mp, followed by Huawei and Realme.

5.The distribution plot clearly shows that the distribution is not normal.

7.Which attributes are highly correlated with the price of a used device?

Observations:

1. used price is highly correlated with new price  and the selfie camera mp.

2. it clearly shows that the price of the used devices tends to be more if the price of the new device was more.

3. good selfie camera is also one of the important factor the customers look for ,when they buy used devices.

4. RAM,battery and screen size may also play role after the selfie camera mp and new price.

Bivariate Analysis of Main Camera

Observation: There is no linear relationship between brand name and main camera mp.

Observation:The plot shows no signs of linear relationship between main camera mp and used price.

Observation:The plot cleary shows that the price of the used devices increases as the release year is closer to the current time.2013 has the least price and 2020 has the most.

Observation: The price of the devices is more if they have 4g compared to that with no 4g.

Observation:The price of the used devices is more with 5g connection compared to that with no 5g.

Observation: There is no relationship between days used and the price of the used device.

Observations:

1.only Android devices have both 4g and 5g.

2.All the devices are of recent years 2019 and 2020.

Observations: The price of the used devices is more for ios operating system.

Observation: The price of the used devices is more with more RAM.

Data Preprocessing

Missing Value Treatment (Imputation of missing Values)

Observations and Necessary steps to be taken

1.There are missing values.
2.lets check for the distribution of the variables with missing data .
3.if the distributions are skewed ,then we will replace the missing values with their median .
4.if the didtributions are normal,then we will replace the missing values with their mean.

Observations:

1. All the attributes with the missing values are skewed,so we can replace the missing values with their median respectively.

Lets check for the missing values

observation:There are no missing values.

Feature Engineering

Lets do log Transformation for the used price and the new price ,as they both are positively skewed and correlated with each other.

Observations: The distribution almost looks normal with some outliers.

Observation:The distribution of new price log looks almost normal with some outliers.

We have to deal with the categorical variables before building the linear regression Model. we will create dummy variables for all the categorical independent features.

Observation: There are 53 columns now,after creating dummy variables.

Outlier Treatment

outlier detection can be made using

  1. boxplot
  2. z_score
  3. IQR

lets check for the ouliers in the data set using boxplot and z_score

Observations: 1. days used has no outliers.There are no data points after the whisker. 2. int memory,weight and battery has many outliers. Lets check for the outliers using Zscore for other variables.

Using IQR method

Observations:

outlier detection using boxplot, Zscore and IQR are done.

we have removed the outliers for battery and weight.

Lets check the model performance without outliers

EDA

lets Explore the data again after the missing value imputation and feature engineering.

Observation: The plots show the exploratory data anlalysis of the data set after missing value treatment and feature engineering. dependent variable is correlated with one or more independent variables, which we will attend to while model building.

Building a Linear Regression model

Data Preparation for modeling

We want to predict the used device price, so we will use the normalized version used_price_log for modeling.

We'll split the data into train and test to be able to evaluate the model. 

Linear Regression using statsmodels

Let's build a linear regression model using statsmodels.

Observations:

1. Negative co-effecient suggests that as the independent variables increases, the dependent variable(used price log) tends to decrease.
2. The used price increases with increase in screen size, main camera mp,selfie camera mp,int memory,ram and new price.

Model performance evaluation

Observations:

The R square dropped from 0.85 to 0.82 ,which shows the performace of the model is not so good.

To DROP or not to drop the Outliers

1.We are trying to build a linear regression model for the used devices.

2.devices can be phone or tablet.

3.The devices age between 2013 &2020.

4.The devices can be smart or simple and they belong to different brands with different specifications.

5.The data can vary from device to device.

6.There are no extreme or incorrect measured data .

Lets check the performance without dropping the Outliers.

Linear Regression using statsmodels

Let's build a linear regression model using statsmodels.

Model performance evaluation

Observations:

1. The R square did not drop drastically.

2.The training 𝑅2 is 0.845, so the model is not underfitting.

3.The train and test RMSE and MAE are comparable, so the model is not overfitting either.

4. MAPE of 4.48 on the test data means that we are able to predict within 4.48% of the used price.

Checking Linear Regression Assumptions

We will be checking the following Linear Regression assumptions:

No Multicollinearity

Linearity of variables

Independence of error terms

Normality of error terms

No Heteroscedasticity

TEST FOR MULTICOLLINEARITY

Multicollinearity occurs when predictor variables in a regression model are correlated. This correlation is a problem because predictor variables should be independent. If the correlation between variables is high, it can cause problems when we fit the model and interpret the results. When we have multicollinearity in the linear model, the coefficients that the model suggests are unreliable.

There are different ways of detecting (or testing) multicollinearity. One such way is by using the Variance Inflation Factor, or VIF.

Variance Inflation Factor (VIF): Variance inflation factors measure the inflation in the variances of the regression parameter estimates due to collinearities that exist among the predictors. It is a measure of how much the variance of the estimated regression coefficient 𝛽𝑘

is "inflated" by the existence of correlation among the predictor variables in the model.

If VIF is 1, then there is no correlation among the 𝑘

th predictor and the remaining predictor variables, and hence, the variance of 𝛽𝑘

is not inflated at all.

General Rule of thumb:

If VIF is between 1 and 5, then there is low multicollinearity.
If VIF is between 5 and 10, we say there is moderate multicollinearity.
If VIF is exceeding 10, it shows signs of high multicollinearity.

Observations:

1.screen size, weight,brand name Huawei,brand name others and brand name samsung have VIF greater than 5.

2. As the screen size increases ,the weight of the device will also increase, they are highly correlated.

3. brand name others had the maximun VIF followed by the screen size , lets remove one by one and check for VIF 
again.

Removing Multicollinearity

To remove multicollinearity Drop every column one by one that has a VIF score greater than 5. Look at the adjusted R-squared and RMSE of all these models. Drop the variable that makes the least change in adjusted R-squared. Check the VIF scores again. Continue till you get all VIF scores under 5. Let's define a function that will help us do this.

Observation:

1. Dropping screen size and weight can impact the model performance .
2. lets drop brand name Apple,since it has the maximun VIF ,and check the VIF again.

After dropping the brand name Apple , the VIF of brand name iOS reduced and so it will not be dropped from the model. Lets drop brand name others next.

lets drop brand name others as it has the greatest VIF compared to the rest.

Observation: After dropping brand name others, only screen size and weight have VIF >5.

Lets drop screen size and check.

Observations: The VIFs of all the feature variables are under 5, so the Multi-collinearity has been handled.

Interpreting the Regression Results:

Adjusted. R-squared: It reflects the fit of the model.
    Adjusted R-squared values generally range from 0 to 1, where a higher value generally indicates a better fit,
    assuming certain conditions are met.
    In our case, the value for adj. R-squared is 0.839, which is good!

const coefficient: It is the Y-intercept.
    It means that if all the predictor variable coefficients are zero, then the expected output (i.e., Y) would be equal to the const coefficient.
    In our case, the value for const coefficient is -60.98


std err: It reflects the level of accuracy of the coefficients.it is 9.107
    The lower it is, the higher is the level of accuracy.

P>|t|: It is p-value.

    For each independent feature, there is a null hypothesis and an alternate hypothesis.
    here 𝛽 is the coefficient of the 𝑖th independent variable.

𝐻𝑜: Independent feature is not significant (𝛽𝑖=0)

𝐻𝑎: Independent feature is that it is significant (𝛽𝑖≠0)

    (P>|t|) gives the p-value for each independent feature to check that null hypothesis. We are considering 0.05 (5%) as significance level.
        A p-value of less than 0.05 is considered to be statistically significant.

Confidence Interval: It represents the range in which our coefficients are likely to fall (with a likelihood of 95%).

Observations:

1.We can see that adj. R-squared has dropped from 0.842 to 0.8389, which shows that the dropped columns did not have much effect on the model.
As there is no multicollinearity, we can look at the p-values of predictor variables to check their significance.

Dealing with the p_values

Observations:

1. Now there are no variables with the p_value >0.05.

2. the performance of the model as also not changed much,but the std err has reduced.

3. The variables we dropped are not affecting the model too much.

Now we'll check the rest of the assumptions on olsmod2.

Linearity of variables

Independence of error terms

Normality of error terms

No Heteroscedasticity

TEST FOR LINEARITY AND INDEPENDENCE

Why the test?

Linearity describes a straight-line relationship between two variables, predictor variables must have a linear relation with the dependent variable.
The independence of the error terms (or residuals) is important. If the residuals are not independent, then the confidence intervals of the coefficient estimates will be narrower and make us incorrectly conclude a parameter to be statistically significant.

How to check linearity and independence?

Make a plot of fitted values vs residuals.
If they don't follow any pattern, then we say the model is linear and residuals are independent.
Otherwise, the model is showing signs of non-linearity and residuals are not independent.

How to fix if this assumption is not followed?

We can try to transform the variables and make the relationships linear.

Observation: we see no pattern in the plot, fitted vrs residuals.Hence the assumptions of Linearity and independence are satisfied.

TEST FOR NORMALITY

Why the test?

Error terms, or residuals, should be normally distributed. If the error terms are not normally distributed, confidence intervals of the coefficient estimates may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares. Non-normality suggests that there are a few unusual data points that must be studied closely to make a better model.

How to check normality?

The shape of the histogram of residuals can give an initial idea about the normality.
It can also be checked via a Q-Q plot of residuals. If the residuals follow a normal distribution, they will make a straight line plot, otherwise not.
Other tests to check for normality includes the Shapiro-Wilk test.
    Null hypothesis: Residuals are normally distributed
    Alternate hypothesis: Residuals are not normally distributed

How to fix if this assumption is not followed?

We can apply transformations like log, exponential, arcsinh, etc. as per our data.
The residuals more or less follow a straight line except for the tails.
Let's check the results of the Shapiro-Wilk test.
Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test.
Strictly speaking, the residuals are not normal.
However, as an approximation, we can accept this distribution as close to being normal.
So, the assumption is satisfied.

TEST FOR HOMOSCEDASTICITY

Homoscedascity: If the variance of the residuals is symmetrically distributed across the regression line, then the data is said to be homoscedastic.

Heteroscedascity: If the variance is unequal for the residuals across the regression line, then the data is said to be heteroscedastic.

Why the test?

The presence of non-constant variance in the error terms results in heteroscedasticity. Generally, non-constant variance arises in presence of outliers.

How to check for homoscedasticity?

The residual vs fitted values plot can be looked at to check for homoscedasticity. In the case of heteroscedasticity, the residuals can form an arrow shape or any other non-symmetrical shape.
The goldfeldquandt test can also be used. If we get a p-value > 0.05 we can say that the residuals are homoscedastic. Otherwise, they are heteroscedastic.
    Null hypothesis: Residuals are homoscedastic
    Alternate hypothesis: Residuals have heteroscedasticity

How to fix if this assumption is not followed?

Heteroscedasticity can be fixed by adding other important features or making transformations.

Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.

Now that we have checked all the assumptions of linear regression and they are satisfied, let's go ahead with prediction.

We can observe here that our model has returned pretty good prediction results, and the actual and predicted values are comparable. We can also visualize comparison result as a bar graph. Note: As the number of records is large, for representation purpose, we are taking a sample of 25 records only.

Observations: The model has done pretty good ,the difference between actual and predicted values is less.

Final Model Summary

Actionable Insights and Recommendations

OBJECTIVE AND DATA ANALYSIS STEPS.

1.The data(used device data) contains the different attributes of used/refurbished phones and tablets.

2.We want to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.

3.The data set was read into jupyter notebook and necessary initial analysis was made, like finding the shape of the data, statistical description of data.

4.The missing values were treated,necessary Feature engineering was done.outliers were detected.

5.Finally We have built a Linear Regression Model using OLS (Ordinary Least Squares).

INSIGHTS

  1. The R square values tells the success of the independent variable in explaing the variability in the dependent variable.
  2. The R square value of 0.839 tells us that the regression is good.
  3. The assumptions of linear regression model are satisfied.
  4. Adjusted R square 0.838 , the model explains all the variability of the response data around its mean.
The model is able to explain ~84% of the variation in the data, which is very good.

The train and test RMSE and MAE are low and comparable. So, our model is not suffering from overfitting.

The MAPE on the test set suggests we can predict within 4.48% of the used price.

Hence, we can conclude the model olsmodel_final is good for prediction as well as inference purposes.

RECOMMENDATIONS

1.The demand for used devices is definetly increasing from year to year.

2.There is a huge demand for all types of used devices,maybe its a tablet or a mobile phone.

  1. Detail Analysis of the data set shows that, there are wide range of devices, with different specifications .

  2. There are many options available in the used market , suitable for everyone and for every budget.

  3. People have many options, to chose from their favorite brands.

.Recent used devices usually cost more.

.devices which have operating system other than anroid,ios may be less in demand and cost less.

.Better camera quality and screen size also have a huge positive impact on the used device market as they are in high demand.

.devices from well known brands are also doing good in the used device market.

.Devices which have both 4g and 5g cost more.

.internal memory and RAM capacity are also key specifications, more capacity better price.

The impact of the COVID-19 outbreak has a positive impact on the used device market, as people buy phones and tablets only for immediate needs.

Refurbishing the used devices, providing insurance and offering attracting deals may definetly attract more customers.

IF Devices with better screen size, better camera ,better storage and os from different brands are available at resonable prices compared to the new devices, then the demand for used devices will increase year by year.